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Abstract 



Many combinatorial problems arising in machine learning can be reduced to the problem 
of minimizing a submodular function. Submodular functions are a natural discrete analog 
of convex functions, and can be minimized in strongly polynomial time. Unfortunately, 
state-of-the-art algorithms for general submodular minimization are intractable for larger 
problems. In this paper, we introduce a novel subclass of submodular minimization 
problems that we call decomposable. Decomposable submodular functions are those 
that can be represented as sums of concave functions applied to modular functions. We 
develop an algorithm, SLG, that can efficiently minimize decomposable submodular 
functions with tens of thousands of variables. Our algorithm exploits recent results in 
smoothed convex minimization. We apply SLG to synthetic benchmarks and a joint 
classification-and-segmentation task, and show that it outperforms the state-of-the-art 
general purpose submodular minimization algorithms by several orders of magnitude. 



1 Introduction 

Convex optimization has become a key tool in many machine learning algorithms. Many seemingly 
multimodal optimization problems such as nonlinear classification, clustering and dimensionality 
reduction can be cast as convex programs. When minimizing a convex loss function, we can rest 
assured to efficiently find an optimal solution, even for large problems. Convex optimization is a 
structural property of continuous optimization problems. However, many machine learning prob- 
lems, such as structure learning, variable selection, MAP inference in discrete graphical models, 
require solving discrete, combinatorial optimization problems. 

In recent years, another fundamental problem structure, which has similar beneficial properties, 
has emerged as very useful in many combinatorial optimization problems arising in machine learn- 
ing: Submodularity is an intuitive diminishing returns property, stating that adding an element to a 
smaller set helps more than adding it to a larger set. Similarly to convexity, submodularity allows 
one to efficiently find provably (near-)optimal solutions. In particular, the minimum of a submodular 
function can be found in strongly polynomial time [11]. Unfortunately, while polynomial -time solv- 
able, exact techniques for submodular minimization require a number of function evaluations on the 
order of n 5 [12], where n is the number of variables in the problem (e.g., number of random variables 
in the MAP inference task), rendering the algorithms impractical for many real -world problems. 

Fortunately, several submodular minimization problems arising in machine learning have structure 
that allows solving them more efficiently. Examples include symmetric functions that can be 
solved in 0(n 3 ) evaluations using Queyranne's algorithm [19], and functions that decompose into 
attractive, pairwise potentials, that can be solved using graph cutting techniques [7]. In this paper, 
we introduce a novel class of submodular minimization problems that can be solved efficiently. In 
particular, we develop an algorithm SLG, that can minimize a class of submodular functions that 
we call decomposable: These are functions that can be decomposed into sums of concave functions 
applied to modular (additive) functions. Our algorithm is based on recent techniques of smoothed 
convex minimization [18] applied to the Lovasz extension. We demonstrate the usefulness of 
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our algorithm on a joint classification-and-segmentation task involving tens of thousands of 
variables, and show that it outperforms state-of-the-art algorithms for general submodular function 
minimization by several orders of magnitude. 



2 Background on Submodular Function Minimization 

We are interested in minimizing set functions that map subsets of some base set E to real numbers. 
I.e., given / : 2 E — > M. we wish to solve for A* e arg min^ f(A). For simplicity of notation, we 
use the base set E = {1, . . . n}, but in an application the base set may consist of nodes of a graph, 
pixels of an image, etc. Without loss of generality, we assume /(0) = 0. If the function / has no 
structure, then there is no way solve the problem other than checking all 2™ subsets. In this paper, 
we consider functions that satisfy a key property that arises in many applications: submodularity 
(c.f., [16]). A set function / is called submodular iff, for all A,Be 2 e , we have 

f(AuB) + f(AnB)<f(A) + f(B). (1) 

Submodular functions can alternatively, and perhaps more intuitively, be characterized in terms of 
their discrete derivatives. First, we define A k f(A) = f(AU{k}) —f(A) to be the discrete derivative 
of / with respect to k <G E at A; intuitively this is the change in /'s value by adding the element k 
to the set A. Then, / is submodular iff: 

A k f(A) > A k f(B), for all A C B C E and k e E \ B. 

Note the analogy to concave functions; the discrete derivative is smaller for larger sets, in the same 
way that <p{x+h)—<p{x) > <p{y+h) — cf){y) for all x <y, h> if and only if (p is a concave function 
on R. Thus a simple example of a submodular function is f{A) = 4>{\A\) where <p is any concave 
function. Yet despite this connection to concavity, it is in fact 'easier' to minimize a submodular 
function than to maximize it 1 , just as it is easier to minimize a convex function. One explanation for 
this is that submodular minimization can be reformulated as a convex minimization problem. 

To see this, consider taking a set function minimization problem, and reformulating it as a mini- 
mization problem over the unit cube [0, 1]™ C R n . Define <G R n to be the indicator vector of the 
set A, i.e., (0ifk(tA 

eA[k] = \nfkeA 

We use the notation x[k] for the fcth element of the vector x. Also we drop brackets and commas 
in subscripts, so e kl = e^ k q and e k = e^ k j as with the standard unit vectors. A continuous 

extension of a set function / is a function / on the unit cube / : [0, 1]™ — > R with the property 
that f{A) = /(e^). In order to be useful, however, one needs the minima of the set function to be 
related to minima of the extension: 

A* e arg min/(A) => e^* e arg min/(£c). (2) 

A£2 E ' xe[0,l]" ' 

A key result due to Lovasz [16] states that each submodular function / has an extension / that not 
only satisfies the above property, but is also convex and efficient to evaluate. We can define the 
Lovasz extension in terms of the submodular polyhedron Pf. 

P f = {v e R" : v ■ e A < f(A), for all A e 2 E }, f(x) = sup v ■ x. 

vGP f 

The submodular polyhedron Pf is defined by exponentially many inequalities, and evaluating / 
requires solving a linear program over this polyhedron. Perhaps surprisingly, as shown by Lovasz, / 
can be very efficiently computed as follows. For a fixed x let a : E — > E be a permutation such that 
f[c r (l)] > • • • > Jc[<r(n)], and then define the set S k = {c(l), • ■ • , c(fc)}. Then we have a formula 
for / and a subgradient: 

n n 

f{x) = 5>[<7(fc)](/(S fe ) - /(Sfc_i)), df(x) sj;^)^) - /(S fc -i)). 
fe=l fe=l 

Note that if two components of x are equal, the above formula for / is independent of the permuta- 
tion chosen, but the subgradient is not unique. 



'With the additional assumption that / is nondecreasing, maximizing a submodular function subject to a 
cardinality constraint | A\ < M is 'easy'; a greedy algorithm is known to give a near-optimal answer [17]. 
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Equation (2) was used to show that submodular minimization can be achieved in polynomial time 
[16]. However, algorithms which directly minimize the Lovasz extension are regarded as imprac- 
tical. Despite being convex, the Lovasz extension is non-smooth, and hence a simple subgradient 
descent algorithm would need 0(l/e 2 ) steps to achieve 0(e) accuracy. 

Recently, Nesterov showed that if knowledge about the structure of a particular non-smooth convex 
function is available, it can be exploited to achieve a running time of 0(l/e) [18]. One way this is 
done is to construct a smooth approximation of the non-smooth function, and then use an accelerated 
gradient descent algorithm which is highly effective for smooth functions. Connections of this work 
with submodularity and combinatorial optimization are also explored in [4] and [2]. In fact, in 
[2], Bach shows that computing the smoothed Lovasz gradient of a general submodular function is 
equivalent to solving a submodular minimization problem. In this paper, we do not treat general 
submodular functions, but rather a large class of submodular minimization functions that we call 
decomposable. (To apply the smoothing technique of [18], special structural knowledge about the 
convex function is required, so it is natural that we would need special structural knowledge about 
the submodular function to leverage those results.) We further show that we can exploit the discrete 
structure of submodular minimization in a way that allows terminating the algorithm early with a 
certificate of optimality, which leads to drastic performance improvements. 

3 The Decomposable Submodular Minimization Problem 

In this paper, we consider the problem of minimizing functions of the following form: 



where c, Wj £ R" and < Wj < 1 and (f>j : [0, Wj ■ 1] — > R are arbitrary concave functions. It is 
shown in the Appendix that functions of this form are submodular. We call this class of functions 
decomposable submodular functions, as they decompose into a sum of concave functions applied to 
nonnegative modular functions 2 . Below, we give examples of decomposable submodular functions 
arising in applications. 

We first focus on the special case where all the concave functions are of the form (f)j(-) = 
dj min(j/j, •) for some yj,dj > 0. Since these potentials are of key importance, we define the 
submodular functions ^ w ^ y (A) = mm(y, w ■ e^) and call them threshold potentials. In Section 5, 
we will show in how to generalize our approach to arbitrary decomposable submodular functions. 

Examples. The simplest example is a2-potential, which has the form <p(\Af){k, l}\), where 0(1) — 
0(0) > 0(1) — 0(2). It can be expressed as a sum of a modular function and a threshold potential: 



Why are such potential functions interesting? They arise, for example, when finding the Maximum 
a Posteriori configuration of a pairwise Markov Random Field model in image classification 
schemes such as in [20]. On a high level, such an algorithm computes a value c[k] that corresponds 
to the log-likelihood of pixel k being of one class vs. another, and for each pair of adjacent pixels, 
a value dki related to the log-likelihood that pixels k and I are of the same class. Then the algorithm 
classifies pixels by minimizing a sum of 2-potentials: f{A) = c • e A + J2k i ^fe/(l — |1 — e ki • e A\)- 
If the value dki is large, this encourages the pixels k and I to be classified similarly. 

More generally, consider a higher order potential function: a concave function of the number of 
elements in some activation set S, <j){\A n S\) where is concave. It can be shown that this can 
be written as a sum of a modular function and a positive linear combination of \S\ — 1 threshold 
potentials. Recent work [14] has shown that classification performance can be improved by adding 
terms corresponding to such higher order potentials <f>j(\Rj fl A\) to the objective function where the 
functions cfrj are piecewise linear concave functions, and the regions Rj of various sizes generated 
from a segmentation algorithm. Minimization of these particular potential functions can then be 
reformulated as a graph cut problem [13], but this is less general than our approach. 

Another canonical example of a submodular function is a set cover function. Such a function can 
be reformulated as a combination of concave cardinality functions (details in appendix). So all 

2 A function is called modular if (1) holds with equality. It can be written asii->jtt-eA for some w G E n . 




(3) 



<t>{\A n {k, l}\) = 0(0) + (0(2) - 0(l))e w • e A + (20(1) - 0(0) - 0(2))f efe!>1 L4) 
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functions which are weighted combinations of set cover functions can be expressed as threshold 
potentials. However, threshold potentials with nonuniform weights are strictly more general than 
concave cardinality potentials. That is, there exists w and y such that ^ w>y {A) cannot be expressed 
as J2j 4>j{\Rj n A\) for any collection of concave <pj and sets Rj. 

Another example of decomposable functions arises in multiclass queuing systems [10]. These are 
of the form f(A) = c ■ + u ■ eA<fi(v ■ e^), where u, v are nonnegative weight vectors and 4> is 
a nonpositive nonincreasing concave function. With the proper choice of <f>j and Wj (again details 
are in appendix), this can in fact be reformulated as sum of the type in Eq. 3 with n terms. 

In our own experiments, shown in Section 6, we use an implementation of TextonBoost [20] and 
augment it with quadratic higher order potentials. That is, we use TextonBoost to generate per-pixel 
scores c, and then minimize f(A) = c ■ + Y^j I ^ ^ Rj 1 1 Rj \ -^1 > where the regions Rj are regions 
of pixels that we expect to be of the same class (e.g., by running a cheap region-growing heuristic). 
The potential function | AtlRj | |i?j\-<4| is smallest when A contains all of Rj or none of it. It gives the 
largest penalty when exactly half of Rj is contained in A. This encourages the classification scheme 
to classify most of the pixels in a region Rj the same way. We generate regions with a basic region- 
growing algorithm with random seeds. See Figure 1(a) for an illustration of examples of regions 
that we use. In our experience, this simple idea of using higher-order potentials can dramatically 
increase the quality of the classification over one using only 2-potentials, as can be seen in Figure 2. 

4 The S L G Algorithm for Threshold Potentials 

We now present our algorithm for efficient minimization of a decomposable submodular function / 
based on smoothed convex minimization. We first show how we can efficiently smooth the Lovasz 
extension of /. We then apply accelerated gradient descent to the gradient of the smoothed function. 
Lastly, we demonstrate how we can often obtain a certificate of optimality that allows us to stop 
early, drastically speeding up the algorithm in practice. 

4.1 The Smoothed Extension of a Threshold Potential 

The key challenge in our algorithm is to efficiently smooth the Lovasz extension of /, so that we 
can resort to algorithms for accelerated convex minimization. We now show how we can efficiently 
smooth the threshold potentials $> Wty (A) = mm(y, w ■ e^) of Section 3, which are simple enough 
to allow efficient smoothing, but rich enough when combined to express a large class of submodular 
functions. For x > 0, the Lovasz extension of ^ w . y is 

^w, y { x ) — supv • x s.t. v < w, v ■ e& < y for all A e 2 E . 

Note that when x > 0, the arg max of the above linear program always contains a point v which 
satisfies v ■ 1 = y, and v > 0. So we can restrict the domain of the dual variable v to those points 
which satisfy these two conditions, without changing the value of $>(x): 

^w, y ( x ) = max v ■ x where T>(w, y) — {v : < v < w, v ■ 1 = y}. 

veT>(-w,y) 

Restricting the domain of v allows us to define a smoothed Lovasz extension (with parameter /i) 
that is easily computed: _ n 

max v x- ^\\v\\ 2 

To compute the value of this function we need to solve for the optimal vector v*, which is also the 
gradient of this function, as we have the following characterization: 



(4) 



x 

To derive an expression for v* , we begin by forming the Lagrangian and deriving the dual problem: 



^^>w, y( x ) = ar S max v ' x ~ T^IMI 2 = ar § m i n 

v£T>(w,y) ^ v£T>(w,y) 



(, 



^w,y( x ) = ^ B min [maxv-x- -\\v\\ +X 1 -v + X 2 -(w-v) + t(y-v-l) 



min -— lla; — tl + \\ — A 2 || 2 + A 2 • w + ty. 
teR,Ai,A 2 >0 2/j, 



If we fix t, we can solve for the optimal dual variables AJ and A 2 componentwise 
we know the optimal primal variable is given by v* — ^ (x — t* 1 + A* — A 2 ). 5 

A^ = max(t*l — x, 0), A 2 = max(cc — t*l — fiw, 0) => v* = min (max ((cc — 0) , w) . 



By strong duality, 
So we have: 
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This expresses v* as a function of the unknown optimal dual variable t*. For the simple case of 
2-potentials, we can solve for t* explicitly and get a closed form expression: 



ek if x[k] > x[l] + p 

ei if x[l] > x[k] + p 

\(e M + ^(x[k] - x[l])(e k - ej)) if \x[k] - x[l]\ < p 



However, in general to find t* we note that v* must satisfy v* ■ 1 = y. So define p^^ w {t) as: 

Px,w(t) = min(max((a; - tl)/p, 0),w) • 1 

Then we note this function is a monotonic continuous piecewise linear function of t, so we can use a 
simple root-finding algorithm to solve p£ w (t*) = y. This root finding procedure will take no more 
than 0{n) steps in the worst case. 

4.2 The SLG Algorithm for Minimizing Sums of Threshold Potentials 

Stepping beyond a single threshold potential, we now assume that the submodular function to be 

minimized can be written as a nonnegative linear combination of threshold potentials and a modular 

function, i.e., , , N / , 

f(A) = c-e A + \ j d^ Wjm {A). 

3 

Thus, we have the smoothed Lovasz extension, and its gradient: 

/"(*) = c-x + Y, « ,, (*) and V/>(x) = c + ^ ,IXV; r (x). 

j 3 

We now wish to use the accelerated gradient descent algorithm of [18] to minimize this function. 
This algorithm requires that the smoothed objective has a Lipschitz continuous gradient. That is, for 
some constant L, it must hold that || V/ M (a?i) - Vf fi (x 2 )\\ < L\\x x - x 2 ||, for all x x , x 2 € E". 
Fortunately, by construction, the smoothed threshold extensions ^^.^.(a;) all have Lip- 
schitz gradient, a direct consequence of the characterization in Equation 4. Hence we have 
a loose upper bound for the Lipschitz constant of /**: L < — , where D — ^2jdj. Fur- 
thermore, the smoothed threshold extensions approximate the threshold extensions uniformly: 
\%, Vi W - ^ j!Vj (x)\ < 1 for all x, so \p(x) f(x)\ < f. 

One way to use the smoothed gradient is to specify an accuracy e, then minimize /** for sufficiently 
small /i to guarantee that the solution will also be an approximate minimizer of /. Then we simply 
apply the accelerated gradient descent algorithm of [18]. See also [3] for a description. Let Pc{x) = 
arg min 3 ,/ 6C \\x — x'\\ be the projection of x onto the convex set C. In particular, P[ 0; i]n(x) = 
min(max(£c, 0), 1). Algorithm 1 formalizes our Smoothed Lovasz Gradient fSLGJ algorithm: 

Algorithm 1: SLG: Smoothed Lovasz Gradient 

Input: Accuracy e; decomposable function /. 
begin 

A 1 = 25. L = f . x -i = z -i = I 1 ; 
fort = 0,1,2, ... do 

g t = Vr(x t _ 1 )/L; z t = P [0A] n (z_ x - El= (^) 9.): Vt = P[o,i]n(x t - g t ); 
if gap t < e/2 then stop; 
xt = {2zt + (t + l)y t )/(t + 3); 
x e = Vu 

Output: e-optimal x £ to mm xe \Q t x\n f(x) 

The optimality gap of a smooth convex function at the iterate y t can be computed from its gradient: 
gap, - max (y t - x) ■ V/"(y t ) - y t ■ V/>(y t ) + max(-V.r(y t ), 0) • 1. 

a;6[0,l] n 

In summary, as a consequence of the results of [18], we have the following guarantee about SLG: 
Theorem 1 SLG is guaranteed to provide an e-optimal solution after running for 0{ — ) iterations. 
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SLG is only guaranteed to provide an e-optimal solution to the continuous optimization problem. 
Fortunately, once we have an e-optimal point for the Lovasz extension, we can efficiently round it to 
set which is e-optimal for the original submodular function using Alg. 2 (see [9] for more details). 

Algorithm 2: Set generation by rounding the continuous solution 

Input: Vector x e [0, 1]™; submodular function /. 
begin 

By sorting, find any permutation a satisfying: £c[cr(l)] > . . . > £c[cr(n)]; 

S k = {a(l),...,a{k)}; K* = arg min fc6{0ili ... in} f(S k ); C = {S k : k e K*}; 

Output: Collection of sets C, such that f(A) < f(x) for all A e C 



4.3 Early Stopping based on Discrete Certificates of Optimality 

In general, if the minimum of / is not unique, the output of SLG may be in the interior of the unit 
cube. However, if / admits a unique minimum A* , then the iterates will tend toward the corner 
e A * ■ One natural question one may ask, if a trend like this is observed, is it necessary to wait for the 
iterates to converge all the way to the optimal solution of the continuous problem min a . e [ 0i i]n f(x), 
when one is actually iterested in solving the discrete problem mm Ae2 E /(^)? Below, we show that 
it is possible to use information about the current iterates to check optimality of a set and terminate 
the algorithm before the continuous problem has converged. 

To prove optimality of a candidate set A, we can use a subgradient of / at e^. If g € df(e A ), then 
we can compute an optimality gap: 

f{A)-f*< max (e A - x) g = V* m&x(0, g{k](e A {k] - e E \ A [k])). (5) 
« e[ o,i]» ^ A 

In particular if g[k] < for k <G A and g[k] > for k E E \ A, then A is optimal. But if we only 
have knowledge of candidate set A, then finding a subgradient g e df(e A ) which demonstrates 
optimality may be extremely difficult, as the set of subgradients is a polyhedron with exponentially 
many extreme points. But our algorithm naturally suggests the subgradient we could use; the gradi- 
ent of the smoothed extension is one such subgradient - provided a certain condition is satisfied, as 
described in the following Lemma. 

Lemma 1 Suppose f is a decomposable submodular function, with Lovasz extension f, and 
smoothed extension / M as in the previous section. Suppose x £ R™ and A E 2 E satisfy the fol- 
lowing property: min x[k] _ x[l] > 2 

keA.leE\A 

Then Vf»(x) e df(e A ) 

This is a consequence of our formula for Vf^, but see the appendix for a detailed proof. Lemma 1 
states that if the components of point x corresponding to elements of A are all larger than all the 
other components by at least 2/i, then the gradient at a; is a subgradient for / at e A (which by 
Equation 5 allows us to compute an optimality gap). In practice, this separation of components 
naturally occurs as the iterates move in the direction of the point e A , long before they ever actually 
reach the point e A . But even if the components are not separated, we can easily add a positive 
multiple of e A to separate them and then compute the gradient there to get an optimality gap. In 
summary, we have the following algorithm to check the optimality of a candidate set: Of critical 

Algorithm 3: Set Optimality Check 

Input: Set A; decomposable function /; scale /i; x e W 1 . 
begin 

7 = 2n + max keAtleE \ A x[l] - x[k]\ g = \7 f^(x + je A ); 
gap = EfceA max (°> 9[k}{e A [k} - e E \ A [k]))\ 
Output: gap, which satisfies gap > f(A) — f* 

importance is how to choose the candidate set A. But by Equation 5, for a set to be optimal, we 
want the components of the gradient V/ M (A + 76,4) [k] to be negative for k e A and positive for 
k € E \ A. So it is natural to choose A = {k : V f^(x)[k] < 0}. Thus, if adding 7^ does not 
change the signs of the components of the gradient, then in fact we have found the optimal set. This 
stopping criterion is very effective in practice, and we use it in all of our experiments. 
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3 Problem Size (n) 

(a) Example Regions for Potentials (b) Results for genrmf-long 



Problem Size (n) 

(c) Results genrmf-wide 



Figure 1 : (a) Example regions used for our higher-order potential functions (b-c) Comparision of 
running times of submodular minimization algorithms on synthetic problems from DIMACS [1]. 

5 Extension to General Concave Potentials 

To extend our algorithm to work on general concave functions, we note that an arbitrary concave 
function can be expressed as an integral of threshold potential functions. This is a simple conse- 
quence of integration by parts, which we state in the following lemma: 

Lemma 2 For e C 2 ([0,T]), 

<P(x) = 0(0) + 0'(7> - / min(x, y)0"(y)dy, Vx £ [0, T] 
Jo 

This means that for a general sum of concave potentials as in Equation (3), we have: 

f{A) = c-e A + J2 (&(°) + ^'K ' J H ' e A - j o * w ., v (A)ct>'!(y)dy 

Then we can define / and by replacing ^ with W and ^ respectively. Our SLG algorithm is 
essentially unchanged, the conditions for optimality still hold, and so on. Conceptually, we just use 
a different smoothed gradient, but calculating it is more involved. We need to compute the integrals 
of the form / W£, ^ y {x)(t>" (y)dy. Since Vij/JJ, tV {x) is a piecewise linear function with repect to y 
which we can compute, we can evaluate the integral by parts so that we need only evaluate <f>, but 
not its derivatives. We leave this formula for the appendix. 

6 Experiments 

Synthetic Data. We reproduce the experimental setup of [8] designed to compare submodular 
minimization algorithms. Our goal is to find the minimum cut of a randomly generated graph (which 
requires submodular minimization of a sum of 2-potentials) with the graph generated by the speci- 
fications in [1]. We compare against the state of the art combinatorial algorithms (LEX2, HYBRID, 
SFM3, PR [6]) that are guaranteed to find the exact solution in polynomial time, as well as the 
Minimum Norm algorithm of [8], a practical alternative with unknown running time. Figures 1(b) 
and 1(c) compare the running time of SLG against the running times reported in [8]. In some cases, 
SLG was 6 times faster than the MinNorm algorithm. However the comparison to the MinNorm 
algorithm is inconclusive in this experiment, since while we used a faster machine, we also used a 
simple MATLAB implementation. What is clear is that SLG scales at least as well as MinNorm on 
these problems, and is practical for problem sizes that the combinatorial algorithms cannot handle. 

Image Segmentation Experiments. We also tested our algorithm on the joint image 
segmentation-and-classification task introduced in Section 3. We used an implementation of 
TextonBoost [20], then trained on and tested subsampled images from [5]. As seen in Figures 2(e) 
and 2(g), using only the per-pixel score from our TextonBoost implementation gets the general area 
of the object, but does not do a good job of identifying the shape of a classified object. Compare 
to the ground truth in Figures 2(b) and 2(d). We then perform MAP inference in a Markov Random 
Field with 2-potentials (as done in [20]). While this regularization, as shown in Figures 2(f) and 
2(h), leads to improved performance, it still performs poorly on classifying the boundary. 



(a) Original Image (b) Ground truth (c) Original Image (d) Ground Truth 




(i) Concave Potentials (j) Continuous (k) Concave Potentials (1) Continuous 



Figure 2: Segmentation experimental results 

Finally, we used SLG to regularize with higher order potentials. To generate regions for our poten- 
tials, we randomly picked seed pixels and grew the regions based on HSV channels of the image. 
We picked our seed pixels with a preference for pixels which were included in the least number of 
previously generated regions. Figure 1(a) shows what the regions typically looked like. For our ex- 
periments, we used 90 total regions. We used SLG to minimize f(A) = c-eA+^2j \Ap\Rj\\Rj\A\, 
where c was the output from TextonBoost, scaled appropriately. Figures 2(i) and 2(k) show the clas- 
sification output. The continuous variables x at the end of each run are shown in Figures 2(j) and 
2(1); while it has no formal meaning, in general one can interpret a very high or low value of x[k] 
to correspond to high confidence in the classification of the pixel k. To generate the result shown in 
Figure 2(k), a problem with 10 4 variables and 90 concave potentials, our MATLAB/mex implemen- 
tation of SLG took 71.4 seconds. In comparison, the MinNorm implementation of the SFO toolbox 
[15] gave the same result, but took 6900 seconds. Similar problems on an image of twice the reso- 
lution (4 x 10 4 variables) were tested using SLG, resulting in a runtimes of roughly 1600 seconds. 

7 Conclusion 

We have developed a novel method for efficiently minimizing a large class of submodular functions 
of practical importance. We do so by decomposing the function into a sum of threshold potentials, 
whose Lovasz extensions are convenient for using modern smoothing techniques of convex opti- 
mization. This allows us to solve submodular minimization problems with thousands of variables, 
that cannot be expressed using only pairwise potentials. Thus we have achieved a middle ground 
between graph-cut-based algorithms which are extremely fast but only able to handle very specific 
types of submodular minimization problems, and combinatorial algorithms which assume nothing 
but submodularity but are impractical for large-scale problems. 
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A Submodularity of Decomposable Functions 

Since the sum of submodular functions is submodular, we need only prove that the submodularity 
of f(A) = <fr(w ■ e A ), where <j> is an arbitrary concave function on R and w > 0. 

By definition of concavity, for all 6 e [0, 1], we have: 

<f>(0(y + h) + (l- 6)x) + 0((1 - 6){y + h) + 6x) > 4>{y + h) + <f>(x) 
If x < y and h > 0, then setting 6 = h/(y — x + h) in the above gives us: 

4>{x + h)- 4>{x) > <j>{y + h)- <j){y) (6) 
Then, for all k £ E \ A, we compute the the discrete derivative Akf(A): 

A k f(A) = 4>{w ■ e A + w[k}) - 4>{w ■ e A ) (7) 

So if A C B C Sandfc e E\ B, then w ■ e A < w ■ e B , so by Eqs. 6 and 7, A k f(A) > A k f(B), 
and hence / is submodular. 



B Reformulation of Set Cover Functions 



A set cover function can be formulated as the function: 

f(A) = |U, eA B,| 

Where Bi are subsets of some base set F, and the Bi form some collection of subsets indexed by E. 
For every k € F, we define the vectors w k G R' E ' as follows: 

k <£ B t 



w k [i 

We claim: 



1 keBi 



f(A) = ^2 mm(l,tu*; • e A ) 

keF 

The fcth term in the sum equals 1 if k G Bi for some i e A and otherwise. The sum of all such 
terms will give the cardinality of the union of the Bi with i e A, which is exactly the set cover 
function. 



C Strict Generality of Threshold Potentials 

As mentioned in the text, any concave cardinality function can be decomposed into the sum of 
several threshold potentials. This is effectively the discrete version of Lemma 2: 

0(|AnS|) = m + (H\S\)-4>(\S\-l))e s -e A + 
\s\-i 

J2 ( 2 ^( fc ) - <H fc - 1) - <t>(k + 1)) min(fc, e s ■ e A ) 

fe=i 

Since <f> is concave, the coefficients (2<f)(k) — <j>(k — 1) — <fr(k + 1)) are nonnegative. So without 
loss of generality, any sum of concave cardinality functions can be expressed as a sum of a modular 
function and nonnegative linear combination of threshold potentials: 

|S|-1 

^2<f) j (\R j r\A\)=c-e A + ^2 ^2 d kl mm(k,e Sk ■ e A ) 

3 S U CE,\S\>1 k=l 

There are J2 m =2 (m)( m — ^) coefficients d k i and they all must be nonnegative. So to check if a 
submodular function f(A) can be expressed as such a sum, we can just write out the 2" constraints 
for each subset: 

|S|-1 

f(A) = c-e A + dki min(fc, e Sk ■ e A ) for all A e 2 E (8) 

S fc CB,|S|>l k=l 

If n = 4, we have 2 4 linear constraints, 4 unconstrained variables from c, and 19 nonnegative 
variables d k i. This is small enough that one can check for feasibility using a linear algebra package. 
We discovered that simple threshold potential f(A) = min(y, w ■ e A ) with w = [1, 2, 3, 4]/4 and 
y = 1 does not have a feasible solution to Eq. 8. 
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D Reformulation of a Class of Functions 



Another example of decomposable functions are the problems under consideration in [10], which 
are of the following form: 

f(A) = c ■ e A + (u ■ e A )4>(v ■ e A ) 

Where u, v are nonnegative weight vectors and is a nonincreasing concave function. Suppose we 
can choose vectors Wj and concave to satisfy: 

^■ eA) = {l(ve A )- m ZIa (9) 
Then we claim the following is an equivalent formulation for / in decomposable form: 

n 

f(A) = (c + 0(O)<z) • e A + ]T u[j}4>( Wj ■ e A ) (10) 

Indeed, plugging Eq. 9 to the above gives: 

f(A) = (c + 0(O)*z) • e A + ^ u[j](0(* • e A ) - 0(0)) = /(A) 



To satisfy Eq. 9 we define as follows: 

0(t) 



if £ < 1 • v 

(j)(t - 1 • v) - 0(0) ifi>l-t> 

And let t«j ; = v + (1 • w)ej. It is straightforward to check that these definitions satisfy Eq. 9. Note 
is concave because is nonincreasing concave. Incidentally, the decomposition in Eq. 10 proves 
that / is submodular. 



E Proof of Lemma 1 

By linearity, it is sufficient to consider the case / = ^f^ y - First we claim that if the hypothesis of the 
Lemma holds, adding a positive multiple of e A will not change the gradient. That is, V*^ (x) = 
^^m, y i x + ae A) for a > 0. Recall the formula for the gradient: 

V^>^ tV (x) = min(max((a; - t*l)/n, 0) 
where t* satisfies min(max((x — t*l)/fM, 0), w) ■ 1 = y 

Consider the effect of adding ae A to x in this formula; either t* is increased by a or it is unchanged; 
in either case the gradient itself is unchanged. Next, note the following scale relationship which 
follows directly from the definition of &w,y'- 

But combined with our first observation this implies 

V^Jx)=V^*(x/a + e A ) 

But the right-hand side of that equation must converge to a subgradient of the nonsmooth function 

as a — >• oo: 

lim Wj a y {x/a + e A ) G dV w . y (e A ) 

which gives the result. 
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F Proof of Lemma 2 



This is straightforward calculation: 

pT px pT 

/ mm(x, y)<j>"(y)dy = I y<j>"(y)dy+ / x<j)"(y)dy 

JO JO Jx 

= {y<t>'{y)-<i>{y))\l + x<t>'{y)\ T x 

= x<t>'{x) - 4>{x) + <j>(0) + x<t>'{T) - x(j)'(x) 
= 4>(p) + x4>'(T)-<t>(x) 

Intuitively, this is a consequence of integration by parts and the fact that J^- min(a;, y) = —8(x — y) 
(the Dirac delta). 

G General Smoothed Gradient Formula 

Let f(A) = (p(w ■ e^) be a general concave potential. For ease of notation, in the following let 
g(y) = \7^>t^ y (x) be the gradient of the smoothed extension of a threshold potential. Then by 
Lemma 2, we have this formula for the gradient of smoothed extention of /: 



PW- 1 

V/"(a:) = 4>'{w l)w- g{y)<t>"{y)dy 
Jo 



Note that g is a piecewise linear function of y. Let the intervals [yi,yi + {\ with = yo < ■ ■ ■ < 
yN = w ■ 1 be the intervals that g is linear on. Let Oi = g(yi), so then = 9 < ■ ■ ■ < On = w. 
Finally let gi{y) be the linear functions that g equals on these intervals,: 

g{y) = 9t (y) for ye [y^-l,y^] 

Denote by g[ = {Oi — — yt-i) the vector which is derivative of gi(y) with respect to y. 

So then our smoothed gradient can be evaluated: 



V/>(x) - <f>'(y N )w-f2 [ ' 9i{y)<t>"{y)dy 

i=l 

JV 

= 4>'(y N )w + (9i<P(y) - QiivWiy))^-! 



1=1 

JV JV 



= <t>'(y N )w + J2a'Myi) - <Kvi-i)) - - ^-^'(^-i)) 

1=1 1=1 

y> (g(yi) - g(yt-i))iHyt) - ^fa-i)) 
^ y%- Vi-i 

Note there are at most 2n points yi, and they can be found all in 0(n log n) time, since it requires a 
sort. So the overall operation count of evaluating this formula is 0(n 2 ) since it requires adding up 
0(n) n-dimensional vectors. 
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